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Abstract 

In  this  paper,  we  propose  new  algorithms  for  computing  the  Dis¬ 
crete  Hartley  and  the  Discrete  Cosine  Transform.  The  algorithms  are 
based  on  iterative  applications  of  the  modified  small  n  algorithms  of 
DFT.  The  one  dimensional  transforms  are  mapped  into  two  dimen¬ 
sions  first  and  then  implemented  on  two  dimensional  systolic  arrays. 
Pipelined  bit  serial  architectures  operating  on  left  to  right  LSB  to 
MSB  binary  arithmetic  is  the  basis  of  the  hardware  design.  Different 
hardware  schemes  for  implementing  these  transforms  are  studied.  We 
show  that  our  schemes  achieve  a  substantial  speed-up  over  existing 
schemes. 


Supported  in  part  by  NSA  Contract  No.  MDA-904-85H-0015,  NSF  Grant  No.  DCR- 
86-00378  and  by  the  Systems  Research  Center  Contract  No.  OIR-85-00108 


1  Introduction 


The  Discrete  Cosine  Transform  (DCT)  has  emerged  as  the  most  popular  trans¬ 
form  for  many  coding  schemes.  This  is  due  to  the  fact  that,  for  a  wide  class  of 
signals,  it  performs  very  closely  to  the  statistically  optimal  Karnuhen-Loeve  Trans¬ 
form  (KLT).  Its  energy  packing  efficiency  is  also  greater  than  any  other  transform. 
Since  the  DCT  block  is  an  integral  part  of  any  image  processing  scheme,  a  study  of 
its  implementation  in  VLSI  is  very  important.  Another  popular  scheme  is  the  Dis¬ 
crete  Hartley  Transform  (DHT)  which  may  replace  the  Discrete  Fourier  Transform 
(DFT)  in  many  spectral  analysis  and  digital  processing  schemes.  The  advantage 
is  that  the  computation  of  this  transform  does  not  require  complex  arithmetic  and 
could  be  done  much  faster. 

There  exist  many  methods  for  computing  the  one-dimensional  DCT.  They 
are  either  based  on  direct  factorisation  of  the  DCT  matrix  [Lee],  [CHF],  or  on  a 
rotation  of  the  output  of  a  Fourier  [NP],  [VN]  or  Hartley  [Mai]  transform.  There  are 
many  methods  [Br],  [PW],  [SJBH],  [Hou]  for  computing  fast  DHT  in  one  dimension 
too.  However,  there  are  no  known  VLSI  implementations  of  these  algorithms.  The 
straightforward  approach  would  be  to  implement  these  algorithms  on  a  linear  array 
of  inner  product  processors.  The  result  is  complicated  control  and  arithmetic  units. 

In  this  paper  we  propose  new  algorithms  to  compute  DHT  and  DCT  along 
with  their  implementations  on  fully  pipelined  bit  serial  two  dimensional  systolic 
arrays.  We  map  the  one  dimensional  transforms  into  two  dimensions  such  that 
N  computations  are  roughly  replaced  by  y/N  subcomputations  which  can  be  done 
fully  in  parallel.  In  the  computation  of  DCT(N),  we  suggest  four  schemes  based  on 
whether  DFT(v/7V)  or  DHT(-v/V)  is  used  for  the  subcomputations  on  the  rows  and 
on  the  columns.  A  comparison  of  these  schemes  lead  us  to  claim  that  the  scheme 
based  on  DHT-DHT  is  the  best.  Moreover,  we  have  improved  upon  the  method 
suggested  by  Sorensen  et.al.  to  compute  the  two-dimensional  DHT.  The  hardware 


design  of  all  these  consists  of  a  few  types  of  regular  and  modular  components. 
The  compatibility  of  the  input/output  characteristics  makes  the  interconnections 
between  the  components  simple.  The  delay  is  small  and  the  throughput  is  very 
high. 

The  rest  of  the  paper  is  organized  as  follows.  In  Section  2,  we  discuss  the 
index  mappings  for  two-dimensional  DFT,  DHT  and  DCT  and  the  various  schemes 
for  their  computation.  Sections  3  and  4  deal  with  the  systolic  array  implementation 
of  the  proposed  schemes.  We  introduce  the  word  design  in  Section  3  and  the  bit 
serial  design  in  Section  4.  We  conclude  the  paper  in  Section  5  with  a  comparison 
of  the  various  schemes  for  computing  DHT  and  DCT  and  an  estimate  of  the  sizes 
and  delays  of  the  various  components. 

2  New  Two  Dimensional  Schemes 

Any  one  dimensional  linear  transform  f(k)  =  a(k,n)x(n), 

0  <  n,k  <  N  —  1,  can  be  mapped  into  a  two  dimensional  transform.  Let  N  —  JVX JV2 
and  let 

n  =  Kirii  +  K2112  mod  N  =  K\ti\  mod  N  +  K2112  mod  N  —  anin2N  (1) 
k  =  Kzk\  +  K^k2  mod  N  =  K3k\  mod  N  +  mod  N  —  (2) 

where  Ki,K2,Kz,K4  are  integer  constants,  anin3,bklk,  are  either  0  or  1, 

0  <  1 ij,  k\  <  Nx  —  1,  0  <  n2,  k2  <  N2  —  1* 

Then  f(k)  can  be  written  in  the  form 

f(ki,k2)  ='$212  a(k  i.*2,ni.»2)£(»i,»3).  (3) 

r»3  r»j 

The  main  motive  behind  this  transformation  is  to  map  the  computation  into  a  sys¬ 
tolic  two  dimensional  array  of  size  Ni  x  N2.  The  number  of  systolic  steps  will  be 
drastically  reduced  from  O(iV)  to  0(Ni  +  N2).  The  success  of  this  transformation 
depends  on  choosing  K\,  K2,  K3,  K4  in  such  a  way  that  equation  (3)  can  indeed 
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be  computed  by  one  or  few  Ni  x  N2  systolic  arrays.  Our  goal  in  this  section  is  to 
establish  such  transformations  for  the  one  dimensional  Discrete  Fourier  Transform 
(DFT),  Discrete  Hartley  Transform  (DHT)  and  Discrete  Cosine  Transform  (DCT). 
We  will  also  sketch  the  procedures  to  execute  the  two  dimensional  computations. 
The  detailed  calculations  are  included  in  Appendices  1,2,3  and  4.  We  have  modi¬ 
fied  an  existing  scheme  for  DHT.  The  schemes  for  DCT  have  not  appeared  in  the 
literature  before. 

2.1  Discrete  Fourier  Transform 

The  one  dimensional  Discrete  Fourier  Transform  of  N  points,  DFT(N), 
is  defined  by 

F(k)  =  4=52  0  <k<N,  (4) 

V"  n=0 

where  Wn  =  exp(j^nk).  Equation  (4)  can  then  be  reduced  to 

=  (s) 

\  N  ni  n2 

where  n  and  k  are  defined  as  in  (1)  and  (2).  The  choice  of  K\  =  N2,  K2  =  iVi, 
K$  =  (iV2_1  mod  Ni)N2,  K\  =  (Wf1  mod  N2)NX,  where  N  =  NiN2  and  N\,  N2  are 
mutually  prime,  gives  a  two  dimensional  DFT  of  the  form 

F(kuk,)  =  -j=  p  [E  4(m,  n2)W^  |  WJg**  (6) 

The  derivation  is  given  in  [Bu].  It  is  clear  that  expression  (6)  consists  of  a  set  of 
column  DFT’s  followed  by  a  set  of  DFT’s  on  the  rows. 

We  now  turn  to  the  one  dimensional  DFT  computation  of  columns  or  rows. 
The  most  popular  method  of  computing  DFT  is  by  Fast  Fourier  Transform  [CT] 

e 

which  uses  0(N  log  N)  operations  as  opposed  to  0(N2)  operations  required  by 
algorithms  for  DFT(N).  Recently,  faster  algorithms,  based  on  the  so  called  small 
n  algorithms,  have  been  derived  by  Winograd  and  others  [AC],  [Gd],  [Sil],  [Win]. 
For  small  values  of  N,  these  algorithms  use  the  fewest  number  of  multiplications. 
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Each  such  algorithm  can  be  expressed  as  Y  =  SnCnTnx,  where  Sn  is  an  iV  x  J 
incidence  matrix,  2V  is  a  J  x  N  incidence  matrix,  is  a  J  X  J  diagonal  matrix 
with  purely  real  or  purely  imaginary  entries,  J  «  N.  Therefore,  a  two  dimensional 
DFT  of  size  N\  X  N2  can  be  expressed  as 

Y  =  SniC^T^  (Sn1Cn1Tn1x)t  .  (7) 

[OJ]  have  provided  a  systolic  implementation  based  on  three  types  of  components, 
namely  summation  component  ( Z  =  TX),  scaling  component  (Z  =  CX),  and 
transpose  component  ( Z  =  XT).  The  interconnection  is  shown  in  Fig.l. 


Fig.l  Interconnection  of  the  various  components  for  DFT 


2.2  Discrete  Hartley  Transform 

The  one  dimensional  Discrete  Hartley  Transform  of  N  points,  DHT(N), 
is  defined  by 


1  27T 

H[k)  =  —7=  5^  x(n)cas  —nk,  0  <  k  <  N  —  1, 

K  ’  Vn  ^0  N  ~  ~ 

where  cas  (x)  =  cos  (x)  +  sin  (x) . 

If  we  define  two  dimensional  arrays  H  and  x,  (8)  can  be  reduced  to 

H{ki,k2 )  =  ~fink' 


(8) 


(9) 


ni  nj 


where  n  and  k  are  defined  as  in  (1)  and  (2).  If  N  =  N1N2  and  Ni,  N2  are  relatively 
prime,  a  choice  of  Ki  —  N2,  K2  =  N\,  K$  =  {N2X  mod  Ni)N2, 

Ki  =  (Ni1  mod  Nt)Ni  gives  a  two  dimensional  DHT  of  the  form 
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H(ki,k2)  =  ^^X)|S*(ni>ri2)ca5|^nifci| 


y/N 
2 

VN 


2lx  u 
cas  —n2k2 

N2 


V  |  Y'x(ni,n2)sin  ^ l  sin  n2k2 .  (10) 

na  Ni  )  N2 


The  derivation  is  given  in  Appendix  1A.  From  expression  (10)  it  is  obvious,  that 
this  is  not  a  simple  DHT  of  the  columns  of  x(nls  n2 )  followed  by  a  DHT  of  the  rows. 
Let  A(kx,n2)  =  x(n1,n2)cas  —n^,  and 

B(k1,n2)  =  | [A(kun2)  +  A(Ni  -  kun2)  +  A{kltN2  -  n2)  -  A(Ni  -  h,N2  -  n2)], 


then 

^  27T 

H{ku  k2)  =  Y^,  B(ki,n2)cas  —n2k2.  (11) 

n3  -^2 

Thus  two  dimensional  DHT  can  be  obtained  [SJBH]  by  computing  the  DHT  of  the 
columns  followed  by  some  adjustments  to  produce  B(ki,n2)  and  then  computing 
the  DHT  of  the  rows.  For  details  see  Appendix  IB. 

One  dimensional  DHT  can  be  produced  by  various  schemes.  We  choose 
the  Winograd-Hartley  transform  algorithm  since  the  number  of  multiplications 
is  minimum.  The  algorithm  can  be  expressed  as  H  =  S^CnTnx,  where  C  = 
i2e[Cjv]  —  Sn,  Tn,  Cn  are  the  matrices  appearing  in  DFT(N).  Since  Cn 

A 

contains  purely  real  or  purely  imaginary  entries,  Cjy  differs  from  in  that  the 
imaginary  entries  are  negated.  Sorensen  et.al.  [SJBH]  claim  that  they  can  compute 
this  replacement  in  place  with  eight  additions  for  four  output  points.  In  Appendix 
4,  we  show  how  matrices  Sn  and  Xjv  can  be  modified  and  how  the  rows  and  columns 
of  x  can  be  permuted  so  that  the  replacement  can  be  computed  with  four  additions 
for  four  output  points.  Two  dimensional  DHT  can  be  computed  using  summation, 
scaling,  transpose  and  adder  components.  The  interconnection  is  shown  in  Fig.2A. 

A  simpler  scheme  would  be  to  compute  two  dimensional  DFT  and  then  sub¬ 
tract  the  imaginary  part  from  the  real  part  to  get  two  dimensional  DHT.  Since  we 
already  have  good  schemes  to  compute  two  dimensional  DFT,  the  only  hardware 


5 


extension  necessary  is  a  row  of  subtractors.  The  interconnection  is  shown  in  Fig.2B. 
The  disadvantage  of  this  method  is  that  we  are  not  utilising  the  fact  that  DHT  is  a 
real  transform,  instead,  we  are  resorting  to  complex  adders  and  multipliers  and  then 
with  the  help  of  another  subtractor  unit  retrieving  DHT.  In  the  next  two  sections, 
we  will  see  that  this  scheme  fares  poorly  compared  to  the  first  scheme  both  in  terms 
of  hardware  as  well  as  delay. 


DHT(X) 


Fig.2A  Interconnection  of  the  various  components  for  Scheme  1  of  DHT 


Fig.2B  Interconnection  of  the  various  components  for  Scheme  2  of  DHT 
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2.3  Discrete  Cosine  Transform 


The  one  dimensional  Discrete  Cosine  Transform  of  N  points,  DCT(N), 
is  defined  [ANR]  by 


°(k)  =  jj  XT  x{n)cos  [^r(2«  + 

C(°)  =  -TT  £  X(n)‘ 

iV  n=0 

l)Jb] ,  l<k<N-l 

(12) 

We  define  a  new  series  y(n)  by 

y(n)  =  x(2n) 

N 

0  <n< - 1 

-  -  2 

=  z(2iV  —  2n  —  1) 

N 

—  <n<  N  -  1. 

2  ~  ~ 

(13) 

With  this  substitution  equation  (12)  reduces  to 

c(k)  =  TF  £  y(n)cos 

JV  n=0 

2*f(4n+1H 

(14) 

c(°)  =  v(n)- 

n=0 

(15) 

Applying  the  same  procedure  as  in  Sections  2.1  and  2.2,  we  define  two  dimensional 

arrays  C(ki,k2)  and  y(ni,n2).  Equation  (14)  then  reduces  to 

C{kuk2)  =  ^ Y,Y,y(nun2)cos  ^(4n  +  l)k  , 

6(0,0)  =  (16) 
rij  f»i 

where  n  and  k  are  defined  as  in  (1)  and  (2).  Since  computation  of  (7(0, 0)  is 
straightforward,  we  shall  not  treat  it  here.  We  define  two  functions  s(fcx)  and  t(k2) 
as  follows  s(ki )  =  {K$k\  mod  N)/(2N2ki),  t(k2)  =  ( K\k2  mod  N)/(2Nik2). 

If  Ni  and  N2  are  mutually  prime  to  each  other,  a  choice  of  K\  =  N2,K2  =  JVi, 

Kz  =  (JV2_1  mod  Ni)N2,  K4  =  (iVf1  mod  N2)Ni  gives  a  two  dimensional  DCT 


^  2  ^  7T  ft  TX  ' 

C(kuk2)  =  -r:^2J2y{nun2)cos  —{2n1+s(ki))ki  +  —(2  n2  +  t(k2))k2  -  -bklk7  . 
Jy  nj  m  L-^1  2 

(17) 
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The  derivation  is  included  in  Appendix  2. 

When  bklk2  =  0 

C(ki,k2)  =  -^-(271!  +  s(fcx))A;i  |  cos  —  (2n2  +  t(k2))k2 

-  ^^^£y{niin2)sin  ^-(2ni  +  s(fci))fci  |  sin  jjr(2n2  +  t(k2))k2  . 

(18) 

When  bkuk3  =  1 

C(kUk2)  =  |x)y(nl’n2)c0s  [^(2ni  +  S(kl))kl  |  sin  ]y^(2n2  +  f(fc2))fc2 

+  [^-(2^  +  5(M)fci  }c°5  j^{2n2  +  t(k2))k2  . 

(19) 

Thus  the  two  dimensional  DCT  can  be  obtained  by  applying  one  dimensional 
transforms  on  columns  followed  by  one  dimensional  transforms  on  rows.  Though 
the  expressions  for  bklik2  =  0  or  1,  seem  very  different,  they  are  actually  not  so. 
We  will  show  in  Section  3,  how  a  small  manipulation  in  the  last  phase  is  enough  to 
realise  the  two  different  expressions. 

When  bklM  =  0 

C(kuh)  =  I  { DCTn ,  [•DCr„1[5(>H,n2)]]  -  DSf„ ,  [psl.lSK  »,)]]}  (20) 

When  bklik3  =  1, 

C(kuk2)  =  jj  {DSfn,  [PCA.IStm.n,)]]  +DCf„,  [PSf„, [»(»„«,)]]},  (21) 

where  DCT(DST)  differs  from  DCT(DST)  in  the  cosine(sine)  argument. 

We  will  first  discuss  the  popular  methods  of  computing  one  dimensional  DCT. 
These  methods  use  either  one  dimensional  DFT  or  one  dimensional  DHT  on  per¬ 
muted  data.  Let 

YF(k)  =  -1=Z  y(n)W$  (22) 

v"  n=0 
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where  y(n )  is  defined  as  in  equation  (13).  C(k )  can  then  be  expressed  as 


C(k) 


2  „  r 

—=Re  e 

y/N  1 


jirk 
2  N 


Yr(k)] . 


(23) 


Thus  a  computation  of  one  dimensional  DFT  followed  by  a  complex  multiplication 
gives  one  dimensional  DCT.  Let 


Y„{k)  = 


1 

7n 


N- 1 

^2  y(n)cas 

n=0 


(24) 


then 


C(k)  = 


Vn 


Yn(k)cas 


+  Yh(N  —  k)cas 


(25) 


Thus  a  computation  of  one  dimensional  DHT  followed  by  real  multiplications  gives 
one  dimensional  DCT. 

Let  Zc(k)  =  DCTn  =  Y.ny{n)cos  jj&n  +  g{k))k  and 
Zs(k )  =  DSTn  —  y{n)sin  ^(2 n+g(k))k,  where  g(k)  is  a  predetermined  function. 

We  need  to  know  how  to  compute  them  in  order  to  be  able  to  compute  the  two 
dimensional  DCT.  Both  DFT  and  DHT  can  be  used  for  this  computation. 

If  we  use  DFT  then, 


Zc(k)  =  Re[e-^k3^YP{k)] 

Zs(k)  =  -Im  [e-&kaWYF{kj\  .  (26) 

If  we  use  DHT  then, 

Zc{k)  =  |  |rH(fc)cas  (--^fcflr(fc))  +  YH(N  -  k)cas 

Zs{k)  =  i  [yH(fc)cas  (jjk9{k)j  ~  Yh{N  -  k)cas  •  (27) 


For  systolic  implementation,  it  is  necessary  for  Yjj{k)  and  Yjj{N  —  k)  to  be  adjacent 
so  that  Zc{k )  and  Zs{k)  can  be  computed  with  constant  time  delay,  irrespective  of 
the  value  of  k.  In  Appendix  3,  we  show  how  to  modify  matrix, Sn,  so  that  this  is 
possible. 
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Since  two  dimensional  DCT  is  a  function  of  Zc(ki),  Zs(ki),  Zc(k 2),  Zs(Ar2),  we 
can  use  either  one  dimensional  DFT  or  DHT  on  the  rows  as  well  as  on  the  columns 
and  then  adjust  using  multipliers  and  adders. 

We  propose  the  following  algorithm 

•  Compute  DFT  or  DHT  on  columns. 

•  Adjust  using  complex  or  real  multipliers  and  adders. 

•  Compute  DFT  or  DHT  on  rows. 

•  Adjust  using  complex  or  real  multipliers  and  adders. 

The  interconnection  is  shown  in  Fig.3.  There  axe  thus  four  possible  schemes  based 
on  whether  DFT  or  DHT  is  used  on  columns  and  on  rows,  DFT-DFT,  DFT-DHT, 
DHT-DFT,  DHT-DHT.  We  shall  discuss  them  in  the  next  section.  Since  both  DCT 
and  DHT  are  real  transforms,  we  would  expect  to  do  better  if  we  used  DHT  to 
compute  DCT.  In  the  subsequent  sections  we  will  show  that  this  is  true. 


Fig.3  Interconnection  of  the  various  components  for  DCT 
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3  Systolic  Implementations 


In  this  section  we  discuss  the  systolic  array  implementations  of  the  two  dimen¬ 
sional  schemes  of  DFT,  DHT  and  DCT.  The  scheme  [OJ]  for  DFT  uses  only  three 
types  of  components,  summation,  scaling  and  transpose.  Our  schemes  for  DHT 
and  DCT  use  these  components  along  with  special  type  of  multipliers  and  adders. 
Here  we  present  the  functional  definitions  and  the  data  flow  through  the  various 
components. 

3.1  DFT  implementation 

Summation  Component:  This  component  performs  the  operation  Z  —  SX, 

where  S  is  an  incidence  matrix  of  size  M  x  JVi,  X  is  the  input  matrix  of  size  N\  x  iV2, 
and  Z  is  the  output  matrix  of  size  M  X  JV2.  Since  S  is  known,  the  elements  of  S 
can  be  directly  embedded.  The  nature  of  S  makes  it  possible  for  the  summation 
component  to  be  constructed  with  three  different  types  of  subcomponents,  addition 
(sij  =  1),  delay  (s,-,-  =  0)  and  subtraction  (s<y  =  —1)  subcomponents.  The  functional 
description  of  each  subcomponent  is  given  in  Fig.4. 


Addition 
S  ubcomponent 

Delay 

Subcomponent 

Subtraction 

Subcomponent 


Xou  t(ijj)  —  X{n(i ,  j) 

Zout{iJ)  —  Zin{i>j )  +  Xin(i,  J ) 

Xo»t(i,j)  =  Xin(i,j) 

Zout{i)j)  =  Zin[i)j) 

X0ut  J  )  = 

Zout{i,j)  =  Zin(i,j)  —  Xin(i,j ) 


Xb,  -» 


’  Xont 


Fig.4  Function  description  of  the  various  subcomponents 
Xin  and  Zin  are  the  values  prior  to  some  clocking,  while  Xout  and  Zout  are  the 
values  generated  after  the  clocking.  The  three  different  types  of  subcomponents  are 
interconnected  as  shown  in  Fig.5. 
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Fig.5  Data  flow  through  the  summation  component 
Scaling  component :  This  component  performs  the  operation  Z  =  CX,  where 

C  is  a  diagonal  matrix  of  size  Ni  X  Ni,  X  and  Z  are  the  input  and  output  matrices 
of  size  Ni  x  N2.  Since  the  elements  are  known  beforehand,  they  can  be  built 
into  this  component.  The  function  description  of  this  subcomponent  is 
Zout{i,j)  —  C(i,i)Xin(i,j). 


Fig.6  Data  flow  through  the  scaling  component 
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A  linear  array  of  such  subcomponents  form  this  component.  The  skewed  data 
through  this  component  are  shown  in  Fig. 6.  Notice  that  the  data  flow  is  consistent 
with  that  of  the  summation  component. 

Transpose  component:  This  component  performs  the  operation  Z  —  X7, 

where  X  and  Z  are  the  input  and  output  matrices  of  sizes  N%  x  iV2  and  N2  x  Ni 
respectively.  The  functional  description  of  each  subcomponent  is  given  in  Fig.7. 


if  C(i,j)  >  0  then 

Xout(ifj)  =  Zout(i,  j)  Yout  (t ,  j )  X{n  ( *)  J ) 

else  if  C{i,j)  <  0  then 

Xout (i) j)  ~  Zout (i, j )  —  FoUt(t,j)  Yin(i,j) 

else 

Xout  (tj  j)  =  Zout(i,j)  —  Yout(i,j)  Z{n(i,  j ) 
endif 


Fig.  7  Function  description  of  transpose  subcomponent 


Fig.8  Data  flow  through  the  transpose  component 
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The  control  flag  Cij  are  set  according  to 
if  ( k  -  j )  <  (n  -  1)  -  j  then  Citj  =  1 
else  if  ( k  —  j)  >  (n  —  1)  +  j  then  Ct  J-  =  —  1 
else  =  0 

where  n  =  max(NuN2)  and  k  is  the  clocking  instant.  The  subcomponents  are 
connected  as  shown  in  Fig. 8. 

3.2  DHT  implementation 

We  have  seen  in  Section  2.2  that  two  dimensional  DHT  can  be  computed  using 
summation,  scaling,  transpose  and  adder  units.  The  input  x  is  formed  by  permuting 
the  columns  of  x.  The  second  summation  component  of  Fig.2A  does  not  have  the 

elements  of  S ^  embedded  in  it,  instead,  the  elements  are  those  of  the  modified 

/  / 

Snu  Stfi,  as  explained  in  Appendix  4.  Let  Tn7  be  obtained  by  permuting  the 

columns  of  Tn„  such  that  TN2(i,j)  and  TN3(i,N2  -  j)  are  adjacent  to  each  other 

for  1  <  j  <  N2  —  1.  The  third  summation  component  of  Fig.2A  has  the  elements  of 

Tn2  embedded  instead  of  those  of  T^2.  The  adder  component  in  Scheme  1  (Fig.2A) 

consists  of  Ni  subcomponents,  each  of  which  compute  either  addition  or  subtraction. 

A 

The  input  to  this  component  is  the  matrix  A  fed  in  a  skewed  manner  and  the  output 
is  the  matrix  B.  They  are  related  as  follows. 

B(i,j)  -  +  AiNx-iJ) 

B(Ni  -  i,j )  =  A(i,j)  -  A{Ni  -  i,j ) 

B(i,N2-j)  =  A(i,N2  —  j)  +  A{Ni-i,Nt-j) 

B(N\  —  t,  N2  —  j)  =  A(i,N2  —  j)  -  A(Ni  —  i,  N2  —  j) 

for  1  <  *  <  Ni  —  1  and  1  <  j  <  N2  —  1. 

B{0,j)  =  A(0,y)  for  0  <  j  <  N2  —  1  and  B(«,0)  =  A(t,0)  for  0  <  *  <  JVi  —  1.  The 
data  flow  is  shown  in  Fig.9. 

The  components  of  Scheme  2  (Fig.2B)  are  identical  to  those  of  DFT,  except  for 
the  adder  in  the  last  stage.  The  adder  component  consists  of  a  linear  array  of  N2 
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subcomponents  each  of  which  perform  the  following  operation 


where  YpR(i,j)  and  are  the  real  and  imaginary  components  of  the  DFT 

output  Yp(i,j).  The  data  flow  is  shown  in  Fig.10. 


Data  flow  through  the  *th  adder  subcomponent  of  Scheme  1 


Fig.9  Data  flow  through  the  adder  component  of  Scheme  1 


Fig.10  Data  flow  through  the  adder  component  of  scheme  2 
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3.3  DCT  implementation 

The  algorithm  for  the  DCT  implementation  can  be  split  into  two  phases,  the  first 
phase  consists  of  the  computation  of  DFT  or  DHT  on  columns  followed  by  adjust¬ 
ment  using  multipliers  and  adders  while  the  second  phase  consists  of  the  computa¬ 
tion  of  DFT  or  DHT  on  rows  followed  by  adjustment  using  multipliers  and  adders. 
The  four  schemes  are  based  on  whether  DFT  or  DHT  is  used  to  operate  on  the 
rows  and  on  the  columns.  Since  the  computation  of  DFT  and  DHT  over  rows  and 
columns  have  already  been  discussed  in  the  previous  sections,  we  discuss  here  only 
the  designs  of  the  various  multiplier-adder  units  .  We  then  proceed  to  give  a  short 
description  of  each  of  the  four  schemes. 

Multiplier-adder  type  A  : 

The  input  to  this  component  consists  of  complex  data  formed  after  computing  DFT 
on  columns.  The  function  description  of  the  tth  component  is  as  follows 
Zc  (*,  j)  =  XR  (i ,  j)cos  a  -  X/  (t ,  j)sin  a 
Zs(i,j )  =  XR(i,j)sin  a  +  Xj(i,j)cos  a , 

where  a  =  j^-ts(t)  and  Xr  and  Xj  are  the  real  and  imaginary  parts  of  the  input 
data  (see  equation  (26)).  The  skewed  data  flow  through  this  component  is  shown  in 
Fig. 11. 


Fig. 11  Data  flow  through  multipler-adder  A 
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Multiplier-adder  type  B  : 

The  input  to  this  component  is  Yu{i,j),  the  data  after  computing  DHT  on  columns. 
The  function  description  of  the  *th  subcomponent  is  as  follows 
Zc{iyj)  =  \{YH{i,j)cas{-a)  +  YH[N  -  i,j)casa) 

Zs(i,j)  -  \[YH(i,j)cas  a  -  YH(N  -  ij)cas(-oc)], 
where  a  —  ~is(i)  (see  equation  (27)).  Yjj{i,j )  and  Yjj(N  —  i,j)  are  made  adjacent 
by  modifying  5jvx  to  as  explained  in  Appendix  3.  Fig.12  shows  the  skewed  data 
flow  through  this  component. 


**0,3 

**0,2 

Ylh* 

— ► 

Y*» 

*Hi,l 

*Hi,o 

— 

**«,-!  ,1 

0 

*h, 


m  /n.° 


^  ^'"o  .2  ZCo,l  ZcQ'O 

ZSo,2  Zso,!  Z^a 

zCi,  >  ZCj,0 


Fig.12  Data  flow  through  multiplier-adder  B 
Multiplier-adder  type  C  : 

The  input  to  this  component  is  complex  data  formed  after  computing  DFT  on  rows 
along  with  a  control  line,  6,y.  The  output  is  the  real  cosine  transform  C(i,j).  The 
function  description  of  the  jth  subcomponent  is  as  follows 
When  bij  =  0 

-  XR(i,j)co$<x-  Xi{i,j)sina 
When  =  1 

C(i,j)  =  Xn(i,j)sin  a  -f  XT{i,j)cos  a, 
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where  a  =  and  Xji(i,j)  and  Xi(i,j)  are  the  real  and  imaginary  components 

of  the  input  data  (see  equations  20,  21,  26).  The  data  flow  through  this  component 
is  shown  in  Fig. 13. 


Fig. 13  Data  flow  through  multiplier-adder  C 
Multipler-adder  type  D  : 

The  input  data  to  this  component  consists  of  two  parts,  Yc(i,j),  which  was  formed 
after  computing  DHT  or  DFT  on  the  rows  of  Zc(i,j),  and  Ys{i,j),  which  was 
formed  after  computing  DHT  or  DFT  on  the  rows  of  Ys(i,j).  The  control  line  b,j  is 
also  fed  as  an  input.  The  function  description  of  the  yth  subcomponent  is  as  follows 
When  bn  =  0 

C(t,j)  =  |[Yc(t,j)cos(-a)  +  Yc{i,N2  -  j)casa  -Ys{i,j)casa 
+Ys{i,N2  -  j)cas(-a)] 

When  b{j  —  1 

C(i,j)  =  ±[Yc{i,j)cas  a  -  Yc{i,N2  -  j)cas(-a)  +  Ys{i,j)cas(-a) 

+Ys{i,N2  -j)cas  a], 

where  a  =  (see  equations  20,  21,  27).  Yc(i,j),(Ys[i,j))  and 

Yc{i,N2  -  j),(Ys(i,N2  —  /))  are  made  adjacent  by  modifying  matrix  SNi  to  Sn2- 

The  data  flow  through  this  component  is  shown  in  Fig.  14. 
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Yc0, 

Ycoa 

Ye b.. 

^C'o.o 

Y. *,3 

Yso, 

Ysoy  ' 

Tft.  o 

Ycxa 

Yci,i  - 

YCl,0 

YsJ 

Y*.  o 

Ydt Tj-1,0 

YSy^  —1,1 

Ysy^ -i,<S  . 

^CfWj/il.o 

^pvj/ji.o 


*■  Co, 3  Co,2  Co,l 

■"  Ci, 2  Ci,i  Cifl 

— “Cvj-1,1  Cn2  -1,0 


—*C  i 


r^/2i,o 


Fig. 14  Data  flow  through  multiplier-adder  D 


C0,o 


Scheme  1  DFT-DFT 

First  phase  :  The  first  summation  unit  has  elements  of  T^x  embedded,  the  second 
summation  unit  has  elements  of  embedded  and  the  scaling  unit  has  elements 
of  CNl  embedded  in  it.  The  multiplier-  adder  is  of  type  A. 

Second  phase  :  The  first  summation  unit  has  elements  of  Tn2  embedded,  the  second 
summation  unit  has  elements  of  Sn2  embedded  and  the  scaling  unit  has  elements 
of  Cn2  embedded  in  it.  The  multiplier-  adder  is  of  type  C. 

The  interconnection  of  the  various  components  is  shown  in  Fig. 15. 


Fig.15  Interconnection  for  Scheme  1 
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Scheme  2  DFT-DHT 

First  phase  :  This  is  identical  to  the  first  phase  of  Scheme  1. 

Second  phase  :  The  first  summation  unit  has  elements  of  Tjv2  embedded,  the  second 

A 

summation  unit  has  elements  of  Sn,  embedded  and  the  scaling  unit  has  elements 
of  embedded  in  it.  The  multiplier-adder  is  of  type  D. 

The  interconnection  of  the  various  components  is  shown  in  Fig. 16. 


Fig.  16  Interconnection  for  Scheme  2 


Scheme  3  DHT-DFT 

First  phase  :  The  first  summation  unit  has  elements  of  T jvj  embedded,  the  second 

A. 

summation  unit  has  elements  of  S ^  embedded  and  the  scaling  unit  has  elements 
of  CjVx  embedded.  The  multiplier-adder  is  of  type  B. 

Second  phase  :  This  is  identical  to  the  second  phase  of  Scheme  1. 

The  interconnection  is  shown  in  Fig. 17. 


20 


Fig.17  Interconnection  for  Scheme  3 
Scheme  4  DHT-DHT 


First  phase  :  This  is  identical  to  the  first  phase  of  Scheme  3. 
Second  phase  :  This  is  identical  to  the  second  phase  of  Scheme  2. 
The  interconnection  is  shown  in  Fig.  18. 


Fig.18  Interconnection  for  Scheme  4 


The  matrices  S,  C,  T  for  computing  the  DCT  for  N  =  35  ( Ni  =  7,  iV2  =  5) 
using  Scheme  4  have  been  listed  in  Appendix  5. 

Thus  we  find  that  the  element  skewed  implementation  of  DFT(N),  DHT(N), 
DCT(N)  requires  an  area  of  O(N)  and  a  computation  time  of  0(max(Ni,  N^)), 
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where  N  =  NiN2  and  Ni,  N2  are  mutually  prime.  Our  designs  are  optimal  since 
the  theoretical  lower  bound  of  AT2  =  Cl(N2)  have  been  achieved. 

4  Bit  serial  implementations 

In  this  section  we  discuss  the  synchronous  bit  serial  implementations  of  the 
components  discussed  in  Section  3.  We  chose  to  use  this  mode  because  for  data 
of  precision  p,  there  is  a  drastic  reduction  in  area  from  0(pNiN2)  to  0(N1N2)  for 
most  of  the  components.  Another  advantage  is  that  the  communication  within  and 
between  VLSI  chips  is  more  efficient.  The  disadvantage  of  having  to  clock  the  bit 
serial  version  p  times  more  can  be  taken  care  of  by  clocking  at  a  faster  rate,  since 
the  basic  subcomponents  are  much  simpler.  We  have  chosen  the  binary  left  directed 
LSB  to  MSB  technique  [JK]  as  against  the  higher  base  right  directed  MSB  to  LSB 
technique  [OJ],  since  the  circuitry  of  the  subcomponents  is  of  lesser  complexity  . 
The  larger  delay  in  the  scaling  subcomponent  is  compensated  by  a  smaller  delay  in 
each  of  the  summation  subcomponents  and  a  higher  clock  rate. 

4.1  DFT  implementation 

In  this  section  we  discuss  briefly  the  bit  serial  versions  of  the  summation,  scaling 
and  transpose  components  as  in  [OJ],  [JK]  but  in  the  light  of  a  different  data  flow 
format.  Every  element  of  data  of  precision  p,  is  followed  by  a  “0”  word,  which  is 
a  string  of  p  zeros.  This  is  necessary  since  2p  bits  are  generated  by  the  bit  serial 
multiplier  and  unless  the  data  is  appended  by  a  “0”  word,  the  results  would  be  erro¬ 
neous.  We  can  thus  think  of  every  data  to  be  a  2p  bit  word  now.  The  control  input 
rin  would  indicate  that  the  first  bit  of  an  element  is  being  supplied.  Thus  rtn  =  1  for 
the  least  significant  bit  of  the  2p  word  and  is  0  otherwise.  We  choose  to  propagate 
the  real  and  the  imaginary  parts  of  the  data  along  the  same  line  inorder  to  further 
reduce  the  size  of  each  subcomponent.  The  sequence  is  as  follows,  the  2 p  bit  real 
part  followed  by  the  2p  bit  imaginary  paxt.  The  control  input  indicates  whether 


the  data  is  real  or  imaginary.  <?,•„  =  0  for  the  real  part  and  is  1  for  the  imaginary  part. 

0  ...  0  (x/.-Jp-!  ...  (x/.-Ji  (x/t.y)0  0  ...  0  (x*fy)p_ !  ...  (xflfy)i  (xRi.y)o  Xin 

0  ...  0  0  ...  0  1  0  ...  0  0  ...  0  1  rin 

1  ...  1  1  ...  1  1  0  ...  0  0  ...  0  0  qin 

Summation  component  :  The  functional  definition  of  the  bit  serial  summa¬ 

tion  subcomponent  is  as  follows. 

Xout  =  X,'n 

T  out  =  ?in 

Qout  ~  Qotit  —  Qin 

if  qin  =  qin  then 
if  nn  =  1  then 

Zout  —  %in  Xjn  2couf 

else 

Zout  ~  Zin  Xin  "b  C»'r»  2 Cout 

endif 

endif 

Since  the  delay  in  the  summation  subcomponent  is  only  one,  the  elements  are  fed 
in  1  bit  skewed  manner. 

Scaling  component  :  The  scaling  subcomponent  is  essentially  a  multiplier  fol¬ 
lowed  by  a  shift  register  so  that  only  the  higher  order  bits  can  be  retained.  There 
are  two  types  of  subcomponents  depending  on  whether  C„  is  real  or  imaginary.  The 
subcomponent  corresponding  to  an  imaginary  Cu  consists  of  additional  circuitry  so 
that  the  data  in  the  real  and  the  imaginary  parts  can  be  swapped.  This  is  pos¬ 
sible  by  having  an  additional  2 p  bit  shift  register.  To  maintain  uniformity  in  the 
bit  skewness  of  the  data,  the  subcomponent  corresponding  to  a  real  Cu  has  to  be 
equipped  with  a  2 p  bit  shift  register  too.  The  functional  definition  of  the  scaling 
subcomponent  is  as  follows. 
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Xout  —  Xin 
f  out  =  fin 
Qout  =  Gin 

if  r,n  =  1  then 

Zout  ~  Xiny  2 cout 

else 

Zout  =  Xiny  +  Z{n  +  Cm  2  C0ut 

endif 
Cout  ~  Gin 

Transpose  component  :  This  component  consists  of  an  array  of  shift  registers 

of  length  2p  —  2.  The  area  of  this  subcomponent  is  thus  a  function  of  p.  The 
functional  definition  of  the  shift  register  SR  of  the  tjth  subcomponent  is  as  follows. 

Xout  =  !/out  =  SR{0] 

for  *  =  0, 1, ...» 2p  —  3 
SR[i]  =  SR[i  +  1] 
if  an  >  0  then 

SR[2p  2]  =  Xin 

else 

SR[2p  —  2]  =  y^ 
endif 

The  control  input  at  the  kth  clock  unit  is  as  follows 
if  k>2{i  +  j)  -{2p-2)Ni 
=  1 

else 

=  -1 

endif. 
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4.2  DHT  implementation 

In  this  section  we  discuss  the  data  flow  formats  and  the  bit  serial  implementation 
of  the  adder  units  of  Schemes  1  and  2  of  DHT.  The  data  in  Scheme  1  is  of  the  form 
as  shown  below. 


0  ...  0  (Xijijp— i  ...  (2'i,yi)o  0  •••  0  [xi,j)p-l  •••  (*<,;)  1  Xin 

0  ...  0  0  ...  0  1  0  ...  0  0  ...  0  1  rin 

We  do  not  need  the  control  input  since  the  data  is  always  real.  The  elements  are 
1  bit  skewed.  The  bit  serial  summation  and  the  transpose  components  are  identical 
to  those  of  DFT.  Since  both  the  data  and  Cu  are  always  real,  the  scaling  component 
performs  only  real  multiplications.  No  additional  shift  registers  are  required.  The 
scaling  component  is  not  only  smaller,  but  the  delay  for  every  element  passing 
through  it  is  2 p  less. 

The  adder  unit  in  Scheme  1  consists  of  [Ni/2\  components,  each  of  which 
consists  of  2 p,  4 p  shift  registers,  an  adder,  a  subtractor  and  a  couple  of  delay  units. 

The  data  flow  through  the  *th  component  is  as  follows. 


0  Ai'Nz-j  0  AiJ 


0  0  AjVi  -i,j 


0  j  0  Bij 

0  0  -HjVi-i.j 


The  data  flow  through  the  various  components  in  Scheme  2  is  identical  to  that 
of  DFT.  The  adder  unit  consists  of  N2  subcomponents,  each  of  which  consists  of  2p 
bit  shift  registers  and  subtractors.  The  data  flow  through  the  j th  subcomponent  is 
as  follows. 
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0 


0 


0 


0 


4.3  DCT  implementation 

In  this  section  we  discuss  the  data  format  and  the  bit  serial  implementation  of  the 
different  multiplier-adder  units  of  DCT.  The  format  of  the  data  is  identical  to  that 
of  DFT.  After  the  first  phase,  the  data  consists  of  the  cosine  part  Xc  and  the  sine 
part  Xs •  We  can  think  of  Xc  to  be  equivalent  to  the  real  part  XR  in  DFT  and  Xs 
to  be  equivalent  to  the  imaginary  part  Xj  in  DFT.  Here  the  control  input  qin  would 
indicate  whether  the  data  is  the  cosine  part  or  the  sine  part,  qin  =  0  for  the  cosine 
part  and  is  1  for  the  sine  part.  The  input  data  X  is  always  real.  Thus  in  Schemes 
1  and  2,  which  compute  DFT  first,  every  element  of  the  input  data  is  followed  by 
three  “0”  words. 

0  ...  0  0  0  ...  0  0  0  ...  0  0  ( Xij)p-i  ...  (x,y)i  (0^)0 

0  ...  0  0  0  ...  0  1  0  ...  0  0  0  ...  0  1 

1  ...  1  1  1  ...  1  1  0  ...  0  0  0  ...  0  0 


In  Schemes  3  and  4,  which  compute  DHT  first,  the  input  data  occurs  in  both 
the  sine  and  the  cosine  part.  This  guarantees  simpler  circuitry  and  less  delay  in  the 
multiplier-adder  B. 


0 

...  0 

0 

(rc,j)p_i 

...  (x,y)  1  (z,j)0  0 

...  0 

0 

(x»j)p— 1 

.  .  .  (Xij )  1  (xij) 

0 

...  0 

0 

0 

0 

1  0 

...  0 

0 

0 

0  1 

1 

...  1 

1 

1 

1 

1  0 

...  0 

0 

0 

0  0 

The  multiplier-adder  units  are  a  combination  of  summation,  scaling  subcom- 

*> 

ponents  timed  properly  with  the  help  of  shift  registers  and  regulated  with  the  control 
inputs.  We  shall  not  discuss  the  functional  definition  of  the  subunits  here.  Other 
than  the  multiplier-adder  units,  the  rest  of  the  components  in  the  various  schemes 


'Mn 

Tin 

Qin 


are  either  identical  to  those  used  in  DFT  or  to  those  used  in  DHT.  Multiplier-adders 
C  and  D  need  a  control  input  b^.  Since  Ni  and  iV2  are  fixed,  the  value  of  b^  for 
0  <  *  <  Ni  —  1  and  0  <  j  <  JV2  —  1  are  known  beforehand.  These  values  can  be 
stored  in  iV2  iVi-bit  shift  registers.  The  clock  would  be  an  AND  of  the  system  clock 
with  rin<fa. 

Multiplier-adder  A  :  The  *th  subcomponent,  1  <  i  <  N\  —  1,  would  con¬ 

sist  of  two  fixed  serial  multipliers,  cos  -j^is(i)  and  sin  jj- is(i ),  2 p  bit  shift  registers, 
an  adder,  a  subtractor  and  switches  based  on  the  control  input  gtn.  The  data  flow 
through  this  component  is  as  follows.  Zc,  Zs,  Xr,  Xi  are  defined  as  in  Section  3.3. 


0 


0 


Zs 


0  ZCi,j 


Multiplier-adder  B  :  The  t'th  subcomponent,  1  <  i  <  [JVi/2j,  would  consist  of 

four  fixed  serial  multipliers,  cas  (— ^is(*)),  cas  (]^*«(*)),  cas  (— ?ra(t)  +  ~is(i)j , 
cas  ^7rs(i)  —  -j^is(ifj ,  switches  based  on  qin,  adders,  subtractors  and  some  simple 
logic  circuitry.  The  data  flow  through  this  subcomponent  is  as  follows.  Zc,  Zs,  Yr 
are  defined  as  in  Section  3.3. 


0  0 


0  YjiN^_ij  o  -i,j 


0  ZSi.  o  Zcij 


0  ZsNl_tj  0  ZcNl_{j 


Multiplier-adder  C  :  The  jth  subcomponent,  1  <  j  <  iV2  —  1,  consists  of  two 

fixed  serial  multipliers,  cos  and  sin  ^ jt(j ),  2p  bit  shift  registers,  an  adder, 

subtractor  and  some  simple  switches  based  on  the  control  inputs  q,n  and  bij.  The 

a 

data  flow  through  this  subcomponent  is  as  follows.  Xr,  Xj,  C  are  as  defined  as  in 
Section  3.3. 
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Multiplier-adder  D  :  The  jth  subcomponent,  1  <  j  <  N2-l,  consists  of  two 

fixed  p  bit  serial  multipliers,  cas  {—fcMjj)  and  cas  ’  adders,  subtractors 

and  some  simple  switches  based  on  qin  and  6i;-.  The  data  flow  through  this  subcom¬ 
ponent  is  as  follows.  Yc,  Ys,  C  axe  as  defined  as  in  Section  3.3. 

0  Ysij  0  Ycij  —*■ 

— '  Ci,j  0  0  0 

0  Ysi'N2-j  0  YciN2_3  — ► 

Thus  we  find  that  the  bit  skewed  implementation  of  DFT(N),  DHT(N),  DCT(N) 
requires  an  area  of  0(pN )  and  a  computation  time  of  0(max(Ni,  JV2)p),  where  p  is 
the  precision,  N  =  NiN2  and  Ni,  N2  are  mutually  prime.  Note  that  these  designs 
achieve  the  optimal  time  performance  for  bit  serial  systolic  architectures. 

5  Conclusion 


In  this  report  we  have  presented  two  dimensional  schemes  for  computing  DFT, 
DHT,  DCT  based  on  the  so  called  small  n  algorithms,  their  systolic  word  and 
bit  serial  array  implementations.  The  section  on  DFT  exists  in  the  literature. 
We  included  it  here,  since,  we  needed  some  of  the  components  for  our  designs  of 
DHT  and  DCT.  Though  the  two  dimensional  formulation  of  DHT  is  not  a  new 
concept,  detailed  VLSI  design  to  compute  it  does  not  exist  in  the  literature.  We 
have  improved  upon  the  method  suggested  by  Sorensen  et.  al.  [SJBH]  and  have 
given  a  fairly  detailed  description  of  the  various  components.  We  have  formulated 
the  mapping  from  one  dimensional  DCT  into  the  two  dimensional  form.  We  have 
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proposed  four  schemes  based  on  whether  DFT  or  DHT  is  computed  on  the  rows 
and  on  the  columns. 

We  claim  that  Scheme  1  for  the  computation  of  DHT  is  superior  to  Scheme 
2.  This  is  because  the  data  in  Scheme  1  is  always  real  and  hence  the  associated 
circuitry  is  simpler.  The  delay  in  the  scaling  and  adder  units  is  significantly  less. 
In  the  bit  serial  mode,  2p  bits  are  necessary  to  describe  each  element  as  against  4 p 
bits  in  Scheme  2.  Thus  the  overall  delay  in  scheme  1  is  half  of  that  of  Scheme  2. 

A  comparison  of  the  four  schemes  for  DCT  leads  us  to  claim  that  scheme  4, 
that  is,  DHT-DHT  is  the  best.  The  reasons  are  as  follows.  First  of  all,  since  the 
data  is  always  real,  its  representation  in  terms  of  bits  as  well  as  control  inputs  is 
simpler.  Secondly,  the  circuitry  is  less  complex.  Thirdly,  the  delay  is  much  less.  For 
instance,  in  the  bit  serial  mode,  the  delay  is  2p  bits  less  for  every  element  through 
each  of  the  scaling  components  as  well  as  the  multiplier- adder  units. 

In  the  formulation  for  DCT,  we  have  assumed  that  Nx  and  N2  are  relatively 
prime.  A  practical  case  would  be  when  Nx,  N2  «  y/N^N^.  We  find  that  when 
N2  —  N\  +  1  or  when  Nx  =  N2  +  1,  there  is  a  remarkable  reduction  in  the  hardware 
needed  in  the  multiplier-adder  units.  In  the  case  when  N2  =  Nx  + 1,  s(kx)  =  1/2  for 
all  kx.  This  implies  that  Xc{Nx  -  kx)  =  Xs{kx )  and  Xs{Nx  -  kx)  =  Xc(fci).  The 
number  of  multipliers  in  multiplier-adder  A  and  B  is  thus  reduced  by  half. 

An  estimate  of  the  sizes  of  the  various  components  used  to  compute  the  DCT 
of  208  ( Nx  =  16,  N2  =  13)  points  with  16  bit  precision  using  Scheme  4  is  as 
follows.  The  technology  is  CMOS.  In  the  first  phase,  the  summation  components 
consist  of  288  subcomponents  each  of  size  approximately  80A  x  90A  with  a  delay 
of  at  most  5ns.  The  scaling  component  consists  of  18  subcomponents  each  of  size 
approximately  1200A  x  1200A.  The  multiplier-adder  subcomponent  consists  of  16 
subcomponents  of  size  2500A  x  2500A.  Though  each  of  these  subcells  work  correctly 
for  a  clock  period  of  20ns,  we  choose  40ns  for  reliable  operation.  The  throughput 
is  20,000  of  such  DCTs  per  second. 
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Appendix  1 


Part  A  : 


H(ki,k2 )  =  -^y2'^2x(n1,n2)cas^rnk 
V  N  n2  n\  ^ 


(1) 


«2  ni 

1  .  27T 


=  —7=Y'Y^x(ni,n2)cas—(KiK3n1ki  mod  N  +  K2K4n2k2  mod  N 

V  n2  nx  ^ 

+  KiK^n\k2  mod  N  +  K2Kzn2k\  mod  N  —  aN),  where  a  =  0  or  1  (2) 

In  order  for  two  dimensional  nesting  to  be  possible,  the  constants  K\,  K2,  K$, 
Ka  have  to  be  chosen  such  that  K\K\n\k2  mod  N  and  K2K^n2ki  mod  N  are  zero. 
Ki  =  aN2  and  K4  =  8N1  guarantee  the  first  term  to  be  zero.  Since  Ni  and  N2  are 
relatively  prime  to  each  other,  we  can  choose  K2  =  f3N\  and  K3  —  jN2  and  make 
the  second  term  zero.  Expression  (1)  then  reduces  to 

1  2i7T 

x(nx,n2)cas  —  ( aryiVfnifci  mod  N  +  /38Nfn2k2  mod  JV) 

V  ™  no  ni  ** 


(3) 


By  choosing  a  =  /?  —  1,  7  =  N2X  mod  N\,  8  =  N[  1  mod  N2,  we  get 

/\  1  ^  27T 

H(ki,k2)  =  —7=  y;  y;  x(ni,  n2)cas  —  {N2nxkx  mod  N  +  Nxn2k2  mod  N) 

v  N  n 2  ni  ™ 


=  ^§?"(ni,n2)c"(^niA:i  +  ^2) 

=  -^J2^J2Hni’n2)casJ^niki^cas^-n2k2 

2  jxi  x(n1,n2)sin  sin  n2k2 


(4) 
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Part  B  : 


H(k i,  k2)  can  also  be  expressed  as 


H(ki,  k2)  = 

+^=  J2  jll  x(nun2)sin  J  cas  (~^n2^) 

1  27T 

LetA(A?i,  n2)  =  . —  Y^  3r(ni,  n2)cas  —  ni&i,  then 

V-W  •  iVa 


„  1  ^2-1  27T 

H(ki,  k2)  =  A(&i,0)  +  -  YZ  {A(k1,n2)  +  A(Ni  -  ki,n2)}cas—  n2k2 

1  n2=l  iV2 

1  JVa_1  27T 

+  x  YZ  {-A(fci,n2)  -  A(JVi  -  &i,n2)}ca.s  -  -~n2k2 

1  n2=l  ^2 

1  JV2-1  27T 

=  A(&1;0)  +  -  5^  {A(&i,n2)  +  A(iVi  -  fci,n2)}cas  — -n2&2 

1  n2= 1  ^2 

1  iV2_1  27T 

+  0  S  {^4(A:i,  iV2  -  n2)  -  A(iVx  —  Ara ,  iV2  —  n2)}cas  — (n2  -  N2)k2 

1  AT2-n2=l  ^2 

1  JVa-l 

==  A(^,0)  +  —  Y"1  {A(fci, n2)  +  A(Nj  —  fcx,  re2)  +  A(fcx,  iV2  —  re2) 

Z  n2=l 

27T 

— A(iVx  —  ki,N2  —  n2)}cas  —n2k2 

Jy2 

N2~l  2ir 

=  22  B{kun2)cas —n2k2,  (6) 

n2 =0  iV2 

where  5(0,  n2)  =  A(0,n2)  for  0  <  rc2  <  N2  —  1, 


5(A:X,0)  =  A(fcx,0)  for  0  <  ki  <  Ni  —  1  and 

B{ku  n2)  =  i[A(fci,n2)  +  A(iVx  -  fcx,n2)  +  A{kuN2  -  n2)  -  A(iVx  -  &x,iV2  -  n2)], 
for  1  <  fcx  <  N\  —  1  and  1  <  n2  <  iV2  —  1. 
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Appendix  2 

C(ki,k2)  =  J2  E  V(n i ,  n2)co5  +  “ ^] 

2  r27T 

=  77  EE^ni>n2)c“  — (isTii^nifei  mod  N  +  I<2K4n2k2  mod  N 

^  W2  n\  **  ™ 

+  KxK4nxk2  mod  N  +  K2K3n2kx  mod  N  —  aN ) 

+  ^(Kzh  mod  N  +  K4k2  mod  N  -  fc^iV)]  (7) 

where  a  is  a  positive  integer  constant.  Two  dimensional  nesting  is  possible  only 
if  the  cross  terms,  namely,  KxK4nxk2  mod  N  and  K2K3n2kx  mod  N  are  zero.  A 
choice  of  Kx  =  aN2  and  K4  =  8NX,  where  a  and  8  are  positive  integer  constants 
guarantee  the  first  term  to  be  zero.  Since  Nx  and  N2  are  mutually  prime,  the  second 
term  can  be  made  zero  by  choosing  K2  —  /3NX  and  K3  =  7^2,  where  (3  and  7are 
positive  integer  constants.  Equation  (7)  then  reduces  to 
2  r27r 

C{k  1,  k2 )  =  —  E  y(ni->n2)cos  —(a^Nlnxkx  mod  N  +  /38N*n2k2  mod  N  -  aN) 

712  **  * 

+  ~^("fN2kx  mod  N  +  8Nxk2  mod  N  -  bklk2N) 

=  T7  E  E y(ni,n2)cos  —{a^Nlnxkx  +  (38Nfn2k2  -  aN) 

^  ’  712  ni  ** 

+  ■^('yN2kx  mod  N +  8Nxk2  mod  N -bklkiN)  ,  (8) 

where  a  is  an  integer  constant.  We  define  two  functions  s(kx)  and  t(k2)  such  that 
s(kx)  =  (-yN2kx  mod  N)/(2N2kx)  and  t(k2)  =  (8Nxk2  mod  N)/(2Nxk2).  It  is  obvi¬ 
ous  that  s(kx)  and  t(k2)  can  be  predetermined.  We  choose  a  —  f}  =  1, 

7  =  -^2_1  m°d  Nx,8  =  N^  mod  N2.  On  substituting  these  values  into  equation  (8) 
we  get 

C(h,h)  =  + 

+  2jyr(2-^2^i>s(^i)  +  2Nxk2t(k2)  —  bklk2N) 

2  _ ^  ^  r  7j-  Tj. 

=  TfEE^K’^cos  —  (2nx  +  s(kx))kx  +  —  (2n2  +  t(k2))k2  -  ^bhk2  . 

rj  ni  '-■‘“l  iv2  Z 

(9) 
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Appendix  3 


One  dimensional  DHT  can  be  expressed  by 

H  =  SnCnTnx ,  (10) 

where  Sn,  Cjv,  Tjv  are  defined  as  in  section  2.2.  Let  H  be  a  permutation  of  H,  such 
that  H  =  (H(0)H(1)H(N  -  1)  •  •  •  H(k)H(N  -  k)---)T ,  1  <  Ar  <  |_iV/2j,  and  N  is 
odd.  H  is  thus  equal  to  PH,  where  P  is  a  permutation  matrix  with  the  following 
characteristics  : 

For  odd  i,  P(i,  fi/2])  =  1 

For  even  i,  and  i  ^  0,  P(i,  N  —  i/2)  =  1 

P(0,0)  =  1. 

H  can  thus  be  expressed  as 

H  —  SnCnTnx ,  (11) 

where  =  PSn.  To  get  the  output  in  the  form  of  H,  all  we  need  is  to  modify 
the  matrix  Sn  to  Sm-  Since  both  P  and  Sn  are  known  beforehand,  Sn  can  be 
precomputed.  This  modification  results  in  an  entire  row  getting  permuted  without 
changing  the  order  of  the  elements  in  a  row.  So,  we  do  not  need  to  modify  Tjv  in 
the  computation  of  DHT  of  rows. 

Example  : 

f  1  ...  . 

.  1  .  .  . 

For  N  =  5,  P  takes  the  following  form  ....  1 

.  .  1  .  . 

V  ...  1  . 
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Appendix  4 


Let  A  =  Snx  C nx  Tn,x. 

In  the  first  step  we  have  to  ensure  that  A(&i,  iV2 ),  A(N\  —  &i,n2),  A(&i,fV2  —  n 2), 
A(iVx  —  k\ ,  N2  —  n2)  will  be  adjacent  to  each  other.  This  can  be  made  possible  by 

1.  permuting  the  columns  of  x  such  that  x(k\,n2)  and  x(ki,N2  —  n2)  are  adjacent 
to  each  other,  for  1  <  n2  <  iV2  —  1. 

2.  replacing  Sn  by  Sjv  as  explained  in  Appendix  3. 

Let  A(Ari,n2)  =  A(ki,n2)  +  A(Ni  —  fci,n2)  and 

A(Nx  -  ki,n2)  =  A(A?i,n2)  -  A(Ni  -  &i,n2),  for  1  <  h  <  [_iV1/2j .  This  is  for  odd 
Ni.  For  even  iVi,  A(N\/2,  ra2)  =  A(iV1/2,  n2).  A  can  thus  be  expressed  as  A  =  PA, 
where  A  and  A  are  the  two  matrices  defined  above.  The  matrix  P  has  the  following 
characteristics  : 

For  odd  i,  P(i ,  i)  =  1;  P(i,  i  +  1)  =  1. 

For  even  i  and  i  ^  0,  P(i,i )  =  —1;  P(i,  i  —  1)  =  1. 

P(0,0)  =  1. 

Thus  A  =  PSnx C Tn, x  =  SnxCnxTn Xx,  where  Sn,  =  PSn, ■  Since  P  is  known 
/ 

beforehand,  Snx  can  be  precomputed.  Thus  A(ki,n2 )  +  A(iVx  —  ki,n2), 

A(k\,  n2)  —  A(Ni  —  k\,  n2),  A(k\,  iV2  —  n2)  +  A(N\  —  &i,  iV2  —  n2)  and 
A(ki,  N2  —  n2)  —  A(Ni  —  ki,  iV2  —  n2)  are  adjacent  to  each  other. 

Example  : 


/  1 


For  N  =  5,  P  takes  the  following  form 


.  \ 

1 

-1 
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Appendix  5 


We  illustrate  Scheme  4  of  DCT  (  DHT-DHT  )  with  the  help  of  an  example. 
Let  N  =  35  with  Ni  =  7  and  JV2  =  5.  The  matrices  SV,  CV,  TV,  S5,  C5,  T5  can 
be  derived  from  the  small  n  algorithms  (n  =  5, 7).  C  is  obtained  by  negating  the 
imaginary  entries  of  C ,  S  is  obtained  by  permuting  the  rows  of  5  as  explained  in 


Appendix  3. 
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